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ABSTRACT 


have  1000  times  present  conventional 
speeds.  The  need  for  such  high-speed 
computation  is  quite  clear  in  certain 
scientific  and  engineering  applications 
especially  involving  3-dimensional  and 
transient  phenomena  in  physical  systems 
with  coupled  thermodynamic,  fluid, 
mechanical,  and/or  electrical  effects. 


The  availability  of  large  scientific 
computers  specially  designed  to  solve 
problems  lCO-1000  rimes  faster  than 
current  conventional  processors  will 
shortly  open  new  opportunities  to 
simulation-oriented  research.  This 
paper  presents  the  attributes  of 
problems  commonly  solved  on  such  machines 
presents  simplified  mathematical  models 
and  corresponding  methods  of  evaluating 
their  performance,  and  gives  results  of 
benchmark  studies. 


This  paper  is  intended  to  introduce  the 
simulation  enthusiast  in  a discipline 
not  included. in  the  above  to  the  concepts 
and  potential  advantage  of  vector  pro- 
cessing. 

Before  proceeding  with  the  technical 
discussion,  it  is  worth  considering 
a common  characteristic  of  most  large 
scale  simulation  problems. 

To  justify  the  need  for  massive  computa- 
tion, simulation  programs  in  general 
require  massive  amounts  of  data  on  which 
to  perform  the  computation  (the  conplc-xity 
of  most  computation  being  limited  to 
0(n)  where  n is  the  number  of  data  ele- 
ments). One  source  of  such  data  is 
automated  measurement  devices;  another 
is  the  algoritlimic  generation  of  large 
data  sets  from  small  ones  as  occurs  in 
the  production  of  large  matrices  for 
solution  of  partial  differential  equations 
given  only  a small  set  of  physical  dimen- 
sions and  constants.  Data  sets  - and 
hence  computation  - dependent  on  personal 
collection  or  other  non-automated  pro- 
duction will  inherently  be  limited  in 
size,  due  principally  to  the  time  for 
collection  and  entry  of  the  data  into  the 
machine. 


INTRODUCTION 


Reading  the  titles  of  papers  in  this 
conference,  one  sees  a concentration  on 
the  mathematics  of  simulation  and  its 
application  to  a variety  of  economic, 
social,  environmental,  and  physical 
systems.  Only  a few  sessions  have  to  do 
with  the  tools  of  most  simulation  studies, 
i.e.,  the  analog,  digital,  and  hybrid 
computer. 

This  preoccupation  is  understandable  in 
part  because  present  machines  have 
significant  computational  power  and  user 
convenience,  allowing  discipline-oriented 
users  to  be  little  concerned  about  design 
characteristics  of  the  computer. 

Recent  advances  in  technology  and  computer 
design  promise  to  significantly  enlarge 
the  size  of  simulation  problems  solveable 
in  reasonable  computation  times.  The  new 
processors  - usually  termed  vector  or 
array  processors  - exploit  regularity  of 
a problem  structure  to  achieve  signifi- 
cantly faster  computation  speeds. 

To  particularize,  recent  benchmarks  have 
shown  that  existing  vector  processors 
can  achieve  speeds  nearly  100  times  those 
of  the  conventional  scalar  processors 
found  in  most  central  computing  facilities 
Furthermore,  current  studies  are  being 
made  of  processors  for  the  1980' s that 


This  algorithmic  generation  of  large  data 
sets  nearly  always  implies  a regular 
problem  structure.  Thus,  large  problems 
inherently  contain  the  solution  structure 
to  be  required  for  vector  processing. 
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DEFINITIONS 

The  rather  special  viewpoint  of  this  paper 
is  indicated  by  the  following  definitions. 

Scalar  processor  • a machine  that 
processes  jingle  and  array  data 
elements  with  similar  speed. 

Vector  processor  - a machine  that 
processes  array  data  elements  at 
a higher  rate  than  single  elements. 

It  should  be  noted  that  (a)  the  computer 
architecture  to  achieve  this  speedup  is 
of  no  consequence  (b)  the  appearance  of 
'Urray  constructs  in  a language  (such  as 
APL)  is  not  related  to  the  speedup  issue 
and  (c)  the  availability  of  high  level 
programming  languages  that  exploit  this 
speed  is  assumed. 


Such  vector  processing  characteristics 
would  be  indicated  in  Fortran  (for  example) 
by  the  ability  to  process  preferentially 

either 


single  loop 

DO  1 J-l.N 
1 A(J)-B(J)*C(J> 

multiple  loop 

DO  1 J-1,N 
DO  1 K-l.N 

1 A(J,K)-A(J,K)+B(J)»C(K)  (1) 

in  one  vector  operation. 

One  method  of  achieving  this  speedup  is 
through  the  use  of  parallel  processors 
Figure  (1).  It  is  easy  to  visualize  that 
the  single  loop  above  could  be  executed 
in  one  parallel  step  by  N processors. 

MATHEMATICAL  DESCRIPTION 

Although  not  a universal  mathematical 
description  of  all  vector  processors,  the 
following  model  provides  a basis  for 
deciding  whetlicr  an  algorithm  is  amenable 
to  vector  processing. 


In  general,  the  time  to  perform  a single- 
or  multi-loop  operation  is  given  by  the 

formula 


T.  ♦ 

s op 
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where  Ts  is  the  time  to  startup  the  vector 
operation,  T^p  is  the  time  to  complete  an 
arithmetic  (or  logical]  computation,  and 
1 is  the  vector  length  (N  in  Equation  (1)). 
The  corresponding  vector  timing  diagram 
is  given  in  Figure  (2).  The  comparison 
with  a seemingly  slower  scalar  processor 
is  also  shown.  .Vote  that  the  vector 
startup  is  shown  to  make  vector  processing 


potentially  slower  for  short  vectors;  the 
crossover  point  A occurs  typically  for 
array  lengths  between  2 and  5. 

ALGORITHM  EVALUATION  | 

To  decide- whether  a- particular  algorithm 
is  amenable  to  vectorized  solution,  the 
following  compact  measure  is  introduced. 


Assume  that  the  algorithm  involves  only 
m vector  operations,  the  ith  operation 
having  length  ii,  startup  time  Ts,  and 
arithmetic  operation  time  Top.  Since 
time  Top  is  the  useful  computation  time, 
define  the  vectorization  efficency  as 

^ , arithmetic  time 

startup  time  * arithmetic  time 


ave  ^ 


where 


ave 


- Z 

i-1 


tj/a 


is  the  average  vector  length.  Note  that 
(•)  Ts/T^  IS-  a processor  parameter 
(typically  between  10  and  100)  whereas 
L«ve  a characteristic  of  the  algorithm 
only; 

(b)  a large  Layg  yields  a higher  efficiency; 

(c)  when  Lave  * T^/Top,  one  half  of  the 
computation  time  Is  devoted  to  useful 
(arithmetic)  computation. 

The  latter  property  allows  a compact 
representation  of  the  vectorizeability 
of  competing  algorithms  or  of  a single 
algorithm  applied  to  a family  of  problems. 
For  example,  in  (3)  a computationally 
efficient  scheme  for  solving  families  of 
finite  element  problems  arising  from 

?artial  differential  equation  solutions 
s evaluated  for  its  vector  characteristics. 
In  Figure  (3),  the  average  vector  length 
is  plotted  versus  grid  size  for  several 
families  of  finite  elements,  tt  is 
immediately  clear  that,  whereas  case  A 
has  marginal  vector  characteristics  for 
10  S Ts/Top  - 100,  case  B is  an  excellent 
candidate  tor  solution  on  any  vector 
processor. 

PROCESSOR/ALGORITHM  EVALUATION 


Once  an  algorithm  is  judged  amenable  to 
vector  processing,  a means  of  evaluating 
its  performance  on  specific  vector  pro- 
cessors is  required.  The  idealizations 
inherent  in  the  characterization  of  an 
algorithm  by  Lave  ^te  now  removed.  For 
example,  a study  of  the  timings  of 
vectorized  algorithms  shows  that  appreciable 
time  may  be  spent  on  unavoidable  scalar 
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operations  which  preamble  the  vector 
portions  of  the  code.  Also,  conflicts 
may  arise  in  the  routing  of  data  on  a 
particular  machine. 

Despite  this  deviation  from  idealized 
performance,  it  is  generally  true  that 
larger  problems  - involving  large  array 
operations  - e.xploit  the  algorithm  and 
processor  vector  attributes  most  fully. 

In  the  limit,  as  problem  size  grows  to 
infinity,  the  combined  performance  is 
usually  governed  only  by  (a)  the  number 
of  arithmetic  operations  .S'ar>  and  (b)  the 
operation  time  Ton-  Thus,  if  n is  the 
problem  size  and  f is  the  total  computation 
time,  the  operation  rate  R 

»(n)  - ^ar^"^ 

T(n) 

has  the  property  that 

R(-)  - lim  ^ar^"^  “ 1/T 

"""  Varf") 

If  R(n)  were  to  be  plotted  versus  n,  this 
representation  would  have  two  properties: 

(a)  an  asymptotic  value  of  1/T  , and 

(b)  an  initial  value  R(0)'  ■ 0,  ^due  to 
program  overhead,  and 

(c)  a usually  monotonic  increase  to  1/T^p 

The  rate  at  which  R(n)  approaches  its 
asymptotic  value  gives  an  indication  of 
the  performance  of  the  algorithm/archi- 
tecture on  small  problems.  An  ideal 
characteristic  would  be  R(ll  * 1/Top- 
The  next  most  favorable  shape  would  oe 
a scalar  processor  characteristic,  where 
the  startup  time  T$  in  Equation  (2)  would 
be  zero. 

Comparisons  have  been  made  of  several  of 
the  current  commercial  vector  processors, 
based  on  their  floating  point  processing 
rate  in  solving  systems  of  simultaneous 
linear  equations,  using  two  different 
classes  of  algorithms  (.Pigurc  (4)). 

Although  such  benchmarks  are  subject  to 
qualification  due  to  the  different  numeric 
precision  and  different  programming 
languages  involved  (1),  this  display 
clearly  shows  certain  processors  with 
Improved  small-problem  performance.  This 
observation  has  since  been  supported  by 
tests  on  a number  of  algorithms  in  refer- 
ence (2). 

IMPACT  O.N  USER 

A review  of  current  vector  processors 
reveals  a distinction  between  "Fortran" 
and  "assembly  language"  processors.  At 
resent,  a design  tradeoff  can  be  made 
etween  a machine  with  a variety  of 


vector-related  hardware  resources  that 
must  be  controlled  from  an  assembly  or 
non-Fortran  language,  and  machines  which 
can  respond  only  to  standard  Fortr.rn 
constructs.  Although  clearly  the  "Fortran" 
machine  cannot  be  faster  than  the  former, 
it  is  unclear  at  this  time  what  the 
tradeoff  will  be  in  attempts  to  achieve  • 
the  very  high  processing  rates  of  the 
1980's. 

IMPACT  ON  MODELING  ‘ 

Perhaps  more  in  keeping  with  this  confer- 
ence is  the  impact  of  machine  architecture 
on  modeling.  Although  a regularity  in 
the  problem  structure  is  necessary,  this 
regularity  may  take  several  forms. 

(1)  In  Figure  (Sa),  a "sparse"  system  is 
illustrated.  Here,  a large  number  of 
small  systems  are  loosely  interconnected. 

In  this  case,  each  vector  (array)  would 
contain  an  element  from  each  system,  so 
that  an  array  operation  would  contribute 
to  the  partial  solution  of  every  small 
system. 

(2)  In  contrast,  the  two  systems  of  Figure 
(5b)  would  be  solved  in  sequence,  on  the 
assumption  that  each  system  is  sufficient- 
ly dense  to  generate  long  vectors  in  its 
solution  (e.g.,  row  operations  used  in 
solution  of  a large  matrix) 

A more  subtle  distinction  between  these 
two  model  systems  occurs,  however,  when 
one  examines  in  detail  the  data  flow 
necessary  to  carry  out  their  solutions. 

In  the  sparse  system  of  Figure  (Sa), 
the  small  system  solutions  must  be  com- 
bined according  to  the  interconnection 
pattern.  Since  the  systems  are  themselves 
quite  small,  the  ratio  of  data  movement 
(corresponding  to  the  interconnection) 
to  computation  (system  solution)  is  less 
than  for  the  dense  system.  The  routing 
of  data  in  a vector  architecture  of  the 
form  of  Figure  (1)  can  be  difficult  under 
any  circumstances,  and  could  easily 
dominate  the  arithmetic  computation  on 
current  vector  processors  for  such  as 
"sparse"  systems. 

CONCLUSION 

The  impact  of  vector  processors  on  the 
general  simulation  field  cannot  be  evalu- 
ated at  this  time.  Although  one  is  tempted 
to  note  that,  like  hybrid  computers, 
vector  machines  will  initially  be  available 
to  only  a few  research  groups  and  will 
require  rather  specialized  programming 
efforts,  the  analogy  is  difficult  to  carry 
further  for  several  reasons. 

First,  hybrid  computer  manufactures  are 
themselves  examining  the  possibility  of 
replacing  hybrid  configurations  with 
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saall  vector  processors  architected  to 
solve  common  simulation  problems  efficient- 
ly. In  such  an  event,  the  class  of  prob- 
lems now  analyzed  with  hybrid  computers 
would  become  a subset  of  problems  solve- 
sble  on  the  new  vector  processors. 


Second,  it  is  reasonable  to  assume  that 
computer  architects  will  strive  to  reduce 
the  Impact  of  vector- related  parameters 
such  as  startup  time  on  computational 
performance  and  that  system  software 
developers  will  reduce  the  problem  of 
user  control  of  architectural  features. 
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Figure  S.  Illustration  of  sparse, dense  systems 


